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Semisupervised methods are techniques for using labeled data 
(Xi, Yl), . . . , [Xn, Yn) together with unlabeled data X„+i, . . . , Xn 
to make predictions. These methods invoke some assumption that 
links the marginal distribution Px of X to the regression function 
fix). For example, it is common to assume that / is very smooth 
over high density regions of Px- Many of the methods are ad- hoc 
and have been shown to work in specific examples but are lacking a 
theoretical foundation. We provide a minimax framework for analyz- 
ing semisupervised methods. In particular, we study methods based 
on metrics that are sensitive to the distribution Px- Our model in- 
cludes a parameter a that controls the strength of the semisupervised 
assumption. We then use the data to adapt to a. 
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Fig 1. The covariate X — (Xi,X2) is two-dimensional. The response Y is binary and is 
shown as a square or a circle. Left: The lahaleled data. Right: Labeled and unlabeled data. 



1. Introduction. Suppose we have data (-'^i, Yi), . . . , 1^) from a 
distribution P, where Xi G and 1^ G M. Further, we have a second set 
of data Xn+i^ ■ ■ ■ ,Xj\f from the same distribution but without the Y's. We 
refer to £ = {{Xi,Yi) : z = 1, . . . ,n} as the labeled data and U = {Xi : 
i = n + 1,...,A'^} as the unlabeled data. There has been a major effort, 
mostly in the machine learning literature, to find ways to use the unlabeled 
data together with the labeled data to constuct good predictors of Y. These 
methods are known as semisupervised methods. It is generally assumed that 
the the m = N — n unobserved labels l^+i, . . . , Y/v are missing completely 
at random and we shall assume this throughout. 

To motivate semisupervised inference, consider the following example. We 
download a large number of webpages Xi. We select a small subset of size 
n and label these with some attribute Yi. The downloading process is cheap 
whereas the labeling process is expensive so typically is huge while n is 
much smaller. 

Figure 1 shows a toy example of how unlabeled data can help with pre- 
diction. In this case, Y is binary, X G and we want to find the decision 
boundary {x : P{Y = 1\X = x) = 1/2}. The left plot shows a few labeled 
data points from which it would be challenging to find the boundary. The 
right plot shows labeled and unlabeled points. The unlabeled data show that 
there are two clusters. If we make the seemingly reasonable assumption that 
f{x) = P{Y = 1\X = x) \s very smooth over the two clusters, then iden- 
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tifying the decision boundaty becomes much easier. In other words, if we 
assume some hnk between Px and /, then we can use the unlabaled data; 
see Figure 2. 

The assumption that the regression function f{x) = E(y|X = x) is very 
smooth over the clusters is known as the cluster assumption. In the special 
case where the clusters are low dimensional submanifolds, the assumption 
is called the manifold assumption. These assumptions link the regression 
function / to the distribution Px of X. 

Many semisupervised methods are developed based on the above assump- 
tions although this is not always made explicit. And even with such a link, 
it is not obvious that semisupervised methods will outperform supervised 
methods. Making precise how and when these assumptions actually improve 
inferences is surprisingly elusive and most papers do not address this issue; 
some exceptions are Rigollet (2007), Singh, Nowak and Zhu (2008a), Laf- 
ferty and Wasserman (2007), Nadler, Srebro and Zhou (2009), Ben-David, 
Lu and Pal (2008), Sinha and Belkin (2009), Belkin and Niyogi (2004) and 
Niyogi (2008). These authors have shown that the degree to which unlabeled 
data improves performance is very sensitive to the cluster and manifold as- 
sumptions. In this paper, we introduce adaptive semisupervised inference. 
We define a parameter a that controls the sensitivity of the distance metric 
to the density, and hence the strength of the semisupervised assumption. 
When a = there is no semisupervised assumption, that is, there is no 
link between / and Px- When q = oo there is a very strong semisuper- 
vised assumption. We use the data to estimate a and hence we adapt to the 
appropriate assumption linking / and Px- 

This paper makes the following contributions: 

1. We formalize the link between the regression function / and the marginal 
distribution Px by defining a class of functions spaces based on a met- 
ric that depends on Px- This is called a density sensitive metric. 

2. We show how to consistently estimate the density-sensitive metric. 

3. We propose a semi-supervised kernel estimator based on the density- 
sensitive metric. 

4. We provide some minimax bounds and show that under some condi- 
tions the semisupervised method has smaller predictive risk than any 
supervised method. 

5. The function classes depend on a parameter a that controls how strong 
the semisupervised assumption is. We show that it is possible to adapt 
to a. 

6. We provide numerical simulations to support the theory. 
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Fig 2. Supervised learning (left) uses only the labeled data d- Semisupervised learning 
(right) uses the unlabaled datalAM to estimate the marginal distribution Px which helps 
estimate f if there is some link between Px and f. This link is the semisupervised (SS) 
assumption. 

In addition, we should add that we focus on regression while most previous 
literature only deals with binary outcomes (classification). 

Related Work. There are a number of papers that discuss conditions un- 
der which semisupervised methods can succeed or that discuss metrics that 
are useful for semisupervised methods. These include Bousquet, Chapelle 
and Hein (2004), Singh, Nowak and Zhu (2008a), Lafferty and Wasserman 
(2007), Sinha and Belkin (2009), Ben-David, Lu and Pal (2008), Nadler, Sre- 
bro and Zhou (2009), Sajama and Orlitsky (2005), Bijral, Ratliff and Srebro 
(2011), Belkin and Niyogi (2004), Niyogi (2008) and references therein. Pa- 
pers on semisupervised inference in the statistics literature are rare; some 
exceptions include Gulp and Michailidis (2008), Gulp (2011a) and Liang, 
Mukherjee and West (2007). To the best of our knowledge, there are no 
papers that explicitly study adaptive methods that allow the data to choose 
the strength of the semisupervised assumption. 

There is a connection between our work on the semisupervised classifica- 
tion method in Rigollet (2007). He divides the the covariate space X into 
clusters Ci, . . . ,Ck defined by the upper level sets {px > A} of the density 
Px of Px- He assumes that the indicator function I{x) = I{p{y\x) > 1/2) is 
constant over each cluster Cj. In our regression framework, we could simi- 
larly assume that 

k 

fix) = Yl h (^)^(^ e C,) + gix)I{x e Co) 
i=i 

where feix) is a parametric regression function, g is a smooth (but nonpara- 
metric function) and Cq = <^ — Uj=i ^j- This yields parametric, dimension- 
free rates over X — Cq. However, this creates a rather unnatural and harsh 
boundary at {x : px{x) = A}. Our approach may be seen as a smoother 
version of this idea. 
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Outline. This paper is organized as follows. In Section 2 we give definitions 
and assumptions. In Section 3 we define density sensitive metrics and the 
function spaces defined by these metrics. In Section 4 we present results on 
estimating density sensitive metrics. In Section 5 we define a density sensitive 
semisupervised estimator and we bound its risk. In Section 6 we present some 
minimax results. We discuss adaptation in Section 8. We provide simulations 
in 9. Section 10 contains closing discussion. Additional proofs are contained 
in Section 11. 

2. Definitions. Recall that Xi e R'^ and Yi G M. Let 

(1) £„ = {(Xi,yi),...,(X„,y„)} 

be an iid sample from P. Let Px denote the X-marginal of P and let 

(2) Un = {Xn+l, . . . , Xn} 

be an iid sample from Px- 

Let f(x) = fp{x) = E(y|X = x). An estimator of / that is a function of 
Cn is called a supervised learner and the set of such estimators is denoted 
by Sn- An estimator that is a function of Cn[_}UN is called a semisupervised 
learner and the set of such estimators is denoted by SS n . Define the risk of 
an estimator / by 



j {fix) - fp{x)fdP{x) 



(3) Rp{f) = Ep 

Of course, Sn SSjsf and hence, 

inf sup Rp{g) < Jnf sup Rp{g). 

geSSN PeV g<^S„ p^-p 

We will show that, under certain conditions, semisupervised methods out- 
perform supervised methods in the sense that the left hand side of the above 
equation is substantially smaller than the right hand side. More precisely, 
for certain classes of distributions Vn, we show that 

mfgs55^ suppgp^ Rpjg) ^ ^ 
infge5„ suppg^,^ Rp{g) 

as n — )• c«. In this case we say that semisupervised learning is effective. 



Remark: In order for the asymptotic analysis to reflect the behavior of 
finite samples, we need to let Vn to change with n and we need = N{n) — )• 
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oo and n/N{n) — )■ as n — t- oo. As an analogy, one needs to let the number of 
covariates in a regression problem increase with the sample size, to develop 
relevant asymptotics for high dimensional regression. Moreover, Vn must 
have distributions that get more concentrated as n increases. The reason 
is that, if n is very large and Px is smooth, then there is no advantage to 
semisupervised inference. This is consistent with the finding in Ben-David, 
Lu and Pal (2008) who show that if Px is smooth, then " ... knowledge 
of that distribution cannot improve the labeled sample complexity by more 
than a constant factor.'" 



Other Notation. If ^ is a set and (5 > we define 

A®5=[j B{x,S) 

where B[x,6) denotes a ball of radius 6 centered at x. Given a set A C R*^, 
define dA{xi,X2) to be the length of the shortest path in A connecting xi 
and X2- 

We write a„ = 0{hn) if \cLn/bn\ is bounded for all large n. Similarly, 
an = ^{hn) if Wn/bn\ is bounded away from for all large n. We write 
On ^ bn if an = 0(a„) and a„ = ^{bn). We also write a„ ■< bn if there exists 
C > such that a„ < C6„ for all large n. Define an ^ bn similarly. We 
use symbols of the form c, ci, C2, . . . , C, Ci, C2, . . . to denote generic positive 
constants whose value can change in different expressions. 

To prove lower bounds, we will use Assouad's Lemma (see Lemma 24.3 
in van der Vaart (1998)). Recall that the Hamming distance between two 
vectors v and w is p{v^ w) = I{vj 7^ Wj). 

Lemma 1 (Assouad's Lemma) Let Q. = {0, 1}'' be the collection of bi- 
nary vectors of length q>\. Let Vq = {P^j '■ uj & Q} be a collection of 2'^ 
probability measures indexed by uj £ 0,. Also let 

\\P^ A P^W = 1 - suj> \P^{A) - P^iA)\ 
A 

denote the affinity between two distributions, where the supremum is over 
all measurable sets A. Let {fco : uj £ 0,} be a collection of functions. For 
any semi-distance d, and any p > 0, 
(5) 

infmaxE,[(iP(/,,/)] > f min ^"^j^^ ||P^AP, 
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3. Density-Sensitive Metrics. We allow the marginal distribution 
Px for X to be arbitrary. We define a smoothed version of Px as follows. 
Let K denote a symmetric kernel on M'^ with compact support, let o" > 
and define 



Thus, px,a is the density of the convolution Px,a = Px*^a where is the 
measure with density i^o-(-) = a~'^K{- / a). Px,a always has a density even 
if Px does not. This is important because, in high dimensional problems, 
it is not uncommon to find that Px can be highly concentrated near a low 
dimensional manifold. And these are precisely the cases where semisuper- 
vised methods are often useful (Ben-David, Lu and Pal (2008)). Indeed, this 
was one of the original motivations for semisupervised inference. We define 
Px,o = Px- For notational simplicity, we shall sometimes drop the X and 
simply write instead of px,(7- 

3.1. The Exponential Metric. Following previous work in the area, we 
will assume that the regression function is smooth in regions where Px puts 
lots of mass. To make this precise, we define a density sensitive metric as 
follows. For any pair xi and X2 let r(xi, X2) denote the set of all continuous 
finite curves from xi to X2 with unit speed everywhere and let ^(7) be the 
length of curve 7; hence 7(L(7)) = X2- For any a > define the exponential 
metric 



In Section 7 we also consider a second metric, the reciprocal metric. Large 
a makes points connected by high density paths closer; see Figure 3. Note 
that a = corresponds to Euclidean distance. Similar definitions are used in 
Sajama and Orlitsky (2005), Bijral, Ratliff and Srebro (2011) and Bousquet, 
Chapelle and Hein (2004). 

3.2. The Regression Function. Recall that f{x) = fp{x) = E{Y\X = x) 
denotes the regression function. We assume that X G [0, 1]'' = X and that 
\Y\ < M for some finite constant M."^ We formalize the semisupervised 
smoothness assumption by defining the following scale of function spaces. 

^ The results can be extended to unbounded Y with suitable conditions on the tails of 
the distribution of Y . 
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Fig 3. With a density density metric, the points X and Z are closer than the points X 
and Y because there is a high density path connecting X and Z. 



Let F = J-'{P, a, a, L) denote the set functions / : [0, l]*^ M such that, for 
all xi,X2 G X, 

(8) |/(xi) - /(X2)| < L Dp,«,,(xi, X2). 

Let V{a,a,L) denote all joint distributions for (X, 1") such that fp G 
J-{P, a, a, L) and such that Px is supported on X. 

3.3. Properties of the Function Spaces. The variance of our estimator 
will depend on 

dP{x) 



(9) / 



^'(5p,a,a(x,e)) 

where Bp^a,aix, e) = {z : Dp^a^(j{x, z) < e}. Let Sp denote the support of P 
and let Mp^o.,a{'^) denote the covering number, the smallest number of balls 
of the form Bp^cc,a{x, e) required to cover Sp. A simple argument shows that 

(10) I P(n^^1 <AAp,.,.(6/2). 

In the Euclidean case a = 0, we have Mpfi.aie) < (C/eY. But when a > 
and P is concentrated on or near a set of dimension less than d, the Afp^a,a (e) 
can be much smaller than [C/eY. The next result gives a few examples 
showing that concentrated distributions have small covering numbers. We 
say that a set A is regular if there is a C > such that, for all small e > 0, 

dA{x,y) 

(11) sup -j-j TT < C. 

x,yeA \\X y\\ 

\\x-y\\<e 

Recall that Sp denotes the support of P. 
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Lemma 2 Suppose that Sp is regular. 

1. For all a, a and P, Np^ci,a{^) ^ 

2. Suppose that P = where 6x is a point mass at x. Then, for 
any a > and any e > 0, A/'p,Q,,o-(e) < k. 

3. Suppose that dim(S'p) = r < d. Then, Np^a,a{() ^ ■ 

4- Suppose that Sp = W (B j where dim(l^) = r < d. Then, for e > Cj, 

Proof. (1) The first statement is follows since the covering number of Sp 
is no more than the covering number of [0, 1]*^ and on [0, 1]"^, Dp^a,a{x,y) < 
||x — y||. Now [0, l]'^ can be covered 0{e~'^) Euclidean balls. 

(2) The second statement follows since {{xi}, . . . , {xfc}} forms an e-covering 
for any e. 

(3) We have that Dp^a,a{x, y) < dsp{x, y). Regularity implies that, for small 
dsp{x,y), Dp^a^fj[x,y) < c||x — y||. We can thus cover Sp by Ce~^' balls of 
size e. 

(4) As in (3), cover W with N = 0(e~'') balls of D size e. Denote these balls 
by Si, . . . ,-BAr. Define Cj = {x G 5p : dsp{x,Bj) < 7}. The Cj form a 
covering of size N and each Cj has Dp^a,a diameter max{e,7}. □ 



4. Estimating Density-Sensitive Metrics. In this section we con- 
sider estimating the density-sensitive metrics. 

4.1. Estimating The Density. Let m = N — n denote the number of 
unlabeled points and let 



(12) F.(x) = ij;i,K(fc^ 

1=1 

be the usual kernel estimator oipc- 

To estimate the distance, we need to bound \ \p(, — Pa\\oo uniformly over 
a range of values of a. The uniform in bandwidth result by Einmahl and 
Mason (2005) provides almost sure bounds of this type. For example, their 
result implies that, almost surely, there is an mo such that for all m > mo 
and all c > 0, and for all (c log m/m)^/'^ < c < 1, 

II- II / /ir(c) log(l/a) V log log m 

V ma°- 

However, such a bound is not uniform over a class of distributions. Instead 
we use the following result whose proof is in Section 11. 
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Theorem 4.1 Let Xi,...,X 

m ^ P where P has support on cl compact set 
XCR'^. Letp^ix) = jsEiKiWx-XiW/a). Let p^{x) =E{p^{x)). Suppose 
that K[x) < K{0) for all x and that 

\K{y)-K{x)\<L\\x-y\\ 

for all X, y. Let < a < ^ < oo. If e < 2/3 then 

where V is the set of distributions such that Px is supported on X . 



2d 1 1 

Thus, for large m, ^ g-cme a _ ^Q^fJ jg^ ^ d(i+-t) -v^i^ere 7 is 
any smah, positive number. Then, with probabihty at least 1 — 1/m, 



II- II / /Clogm 

(15) sup -Pcrlloo < W — -1 • 

a^<a<A V O'Tn'm 



4.2. Estimating the Exponential Distance. Define 

(16) Da,a{xi,X2)= inf / exp [-aij^(7(t))] (it. 

7er(xi,z2) Jo 

Lemma 3 Suppose that \ \pa{x) — Pa[x)\\oo < em- ^o?^ o,ll xi,X2 

(17) e-"^™Z)p,«,^(xi,X2) < Da,a{xi,X2) < e"^'"Z)p,«,^(xi,X2). 
Proof. Follows easily from the definition of the metric. □ 

4.3. A Computable Estimator. Although the above estimator Da,a is 
consistent, it is not easily computable because it involves searching over all 
possible paths connecting each pair of points. In fact, even if P is known, 
Dp,a,a is not computable. A computable estimator for the exponential dis- 
tance was proposed by Sajama and Orlitsky (2005) but this estimator is 
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only consistent for a = 1. Bijral, Ratliff and Srebro (2011) presented a com- 
putable estimator but it is not consistent. Here we given an algorithm that 
approximates D^^a and is consistent, uniformly over a range of values of a. 



Remark: In this section we assume that Px is known and we show how to 
approximate Dp^a,a- When Px is unknown, simply subsitute px,a for px,ct- 
The bounds then get multiplied by a factor of e"*^™. 



In what follows, "path" always means piecewise differentiable, continu- 
ous, finite length curve with unit speed where differentiable. Define i^max — 
sup II Vi<'(n) II and Kmax = sup K{u) = K{0), and suppose that K is sup- 



ported on the unit ball. Let cr, 
convex hull of A" © cxmax- Let 



max ^ "^min ^ and C^max 



> 0. Let X* be the 



C = {ui, . . .,uj} 

be a Euclidean ^-covering of X* . (The cover can, but need not, include the 
observed data.) 

For any < a < Omax and dmin < c < cJmax, define the graph Ga,a = 
{V, E,Wa,a) where V = {vi, ...,vj}, {vi,Vj) G £^ iff ||nj — Uj\\ < ^, and for 
i,j s.t. {vi,Vj) S E define the edge weight 



' ' a, a 



UjW exp 



-apx,a 



Ui + 



Note that each node vj corresponds to one point uj in the cover. Also define 
Ga,rj = {V,E,Wa,a) where Wa^a = Dp^a,a{ui,Uj) for i,j s.t. {vi,Vj) e E. 

For any < a < Omax anjd cJmin < a < dmax, for i,j G {1,...,J}, 
define the estimated distance Dp ^^{ui, uj ) and the intermediate distance 
Dp^a,cT{ui,Uj) to be the graph (i.e. shortest-path) distances between vertices 
Vi and Vj on Ga,a and Ga,a, resp. Note that the distances are only defined 
for points in {uj}/^^. Let 



A 



exp 



Cti^max 



aK* 



a' 



d+l 



< exp 



C^max-^max 



Ctmax-^max 



d+1 



Theorem 4.2 If C, <7 / (32A) then, for any < a < Omax o.nd fTmin < o" < 
Umax and for any i,j E {1,..., J}, 

1 



(18) 



8 



Dp^a,a{Ui,Uj) < Dp^a,a{Ui,Uj) < 8Dp^a,a{Ui, Uj] 
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Note that this bounds D, which is sufficient for our purposes. It is possible 
to modify the proof to show that, in fact, D consistently estimates D. 



Proof. Let ^ = 4(^. By Proposition 2 in Section 11 we have that for i,j 
s.t. {vi,Vj) G E 

(1 - Ae)H^^;i < Wt^^ < (1 + Ae/2)i^^',i. 
So for all i,j £ {1, J}, 

Da,a,p{Ui,Uj) < Da,a,p{Ui,Uj) < ^ ^ Da,a,p{Ui,Uj). 



1 + Ae/2 



Clearly for all i,j £ {1,...,J}, Dp^a,cT{ui,Uj) < Da^fj^p{ui,Uj). Also for 
i,j such that \\ui — Uj\\ < ^ Dp^a,a{ui,Uj) = Da^fj^p{ui,Uj). Suppose i,j 
such that \\ui — Uj\\ > ^. Let 7 be the path such that Dp^a,aiui,Uj) = 

f exp [—apx,a{'y{t))] dt. Of course, ^(7) > Divide 7 into a sequence of 


paths 70,71: •••,7Q such that 70(0) = m, 7Q{L{-fQ)) = uj, L{jo) G (0,^-2C], 
and for k £ {1, ...,Q} L(7fc) = ^ - 2C and 7fc_i(L(7fc_i)) = 7a:(0). Clearly 
Q > 1. For k G {1, Q} let = 7a;(0), let rg+i = Uj, and let nfc such that 
W^k — Un,.\\ < C- By Proposition 1 in Section 11, 



k=l 

Q 

< Dp^oi,a{ui,ri) + ^ {2Dp^a,a{rk,Un^) + -Dp,a,CT(''fc, ''fc+l)) 

fc=l 

^ V Dp^a,a{rk,rk+l, 

Q r y 

< Dp^a,a (Uj , ri ) + ^ Dp^a,a (rfc , rfc+i ) 1 + 2C ( A + 



fc=l 



1 + AC 
C-2C 



< 



1 + 2C A + 



Dp,a,a{Ui,Uj) 



V e-2C, 

and the result follows since ^ = and ( < 7/(32A). □ 



Remark: For any points x,y not in the cover C, we can define D{x,y) = 
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1. Construct a Euclidean ^-covering {uijf^i of the convex hull of A" © a. 

2. Construct a graph with the covering points as nodes, and edges between pairs 
of points closer than ^ for some ^ > 2C,. 

3. Set the edge weight between connected neighbors i and j to 

Ui + Uj 



\\Ui — Uj\\ exp \ —apx,, 



•-)] 



4. Approximate the Z)p,a,o--distance between any two points as the graph 
(i.e. shortest path) distance between the corresponding nearest neighbors in 



Fig 4. Computing Density-Sensitive Metrics 



D{ui,Uj) where Ui is the closest point in C to x and Uj is the closest point 
in C to y. 

A summary of the algorithm is given in Figure 4. 



To further speed up the algorithm, we have found the following heuristic 
to be useful. We approximate the edge weight between all pairs of points i 
and J by 
(19) 



\Xi — XjW exp 



if Xj\s a A;-nearest neighbor of Xj, 



|Xj — Xjll otherwise 



where k is an integer and pi is the fc-NN density estimate at the z'th point. 



5. Density-Sensitive Inference. We consider the following semisu- 
pervised learner which uses a kernel that is sensitive to the density. Let Q 
be a kernel and let Qh{x) = h^'^Q{x/h). Let 



T:i=iYiQh[DaA^,X,. 

(20) fh,aAx) 



In the following we take, for simplicity, Q{x) = /(||x|| < 1). Now we give an 
upper bound on the risk of fh,a,a- 
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Theorem 5.1 Suppose that \Y\ < M. Define the event Qm = {\\Pa — 
Perl loo ^ Em} (which depends on the unlabeled data) and suppose that¥(Q^) < 
1/m. Then, for every P G V{a,a,L), 

(21) Rp{fh,a,a) < L^he'^'-^f + ^ ^ ^ ' ' ' '-^ + . 



n 



m 



Proof. The risk is 



Rpif) = En. 



N 



(l-^m) UKaA^)- f{x)fdP{x) 



+E, 



ri,N 



ifh ) - f{x)fdP{x) 



Since |y| < M and sup^. |/(x)| < M, 



'n,N 



(1 - Qm) / Uh,c.Ax) - f{x)fdP{x) 



< 4M¥(g^) < 



4M^ 



m 



Now we bound the second term. 

Condition on the unlabeled data. Replacing Euclidean distance with Dc 
in the proof of Theorem 5.2 in Gyorfi et al. (2002), we have that 



where 



ifh,aAx) - f{x)fdP{x) 



M2 



4 I 1 r <iP{^) 



n 



R = sup|l'p^q,^(j(xi, X2) : (xi,X2) such that Da,a{xi,X2) < /i| 

and Ba,a{x,h) = {z : Da^a{x,z) < h}. On the event Qrm we have from 
Lemma 3 that e~°'^™Da_o-(a;i, 2:2) < Da,a{xi, X2) < e°"^"^Da^aixi,X2) for all 
xi,X2. Hence, i?^ < e^"^'"/i^ and 



< 



dP{x) 



P{B^Ax,h)) ~ J P{Bp^aAx,e-»^rnh))- 

A simple covering argument (see p 76 of Gyorfi et al) shows that, for any 
6>0, 

dP{x) 



P{Bp,aA^,6)) 



<MiP,a,a, 5/2). 



The result follows. □ 
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Corollary 5.2 If Af{P,a,a,6) < for S > (1/2)6-"^- (ne 

and N >2n then 

(22) Rp{fa,a,h) < e"^-^'^^) 



2at„ 



1 

"2+? 



1 /C 



n \ h 



+ 



8M2 



TO 



Hence, if m > n^/(^+^), and h x (ne"^'"(^ ^)) 2+? then 



(23) 



sup Rp{fh,a,a) ^ 
PeP(a,<T,L) 



n 



6. Minimax Bounds. To characterize when scmisupervised methods 
outperform supervised methods, we show that there is a class of distributions 
Vn (which we allow to change with n) such that Rgs is much smaller than 
i?5, where 

Rs = inf sup Rp{f) and Rss = inf sup Rp{f) 
feSn PeVn fesSN PeVn 

To do so, it suffices to find a lower bound on Rs and an upper bound on Rss- 
In intuitively, Vn should be a set distributions whose X-marginals are highly 
concentrated on or near lower-dimensional sets, since this is where semisus- 
pervised methods deliver improved performance. Indeed, as we mentioned 
earlier, for very smooth distributions Px we do not expect semisupervised 
learners to offer much improvement. 

6.1. The Class Vn- Here we define the class Vn- Let A'' = N{n) and 
TO = m{n) = N — n and define 



(24) Em = €{m, a) 

Let ^ G [0, d — 3), 7 > and define 
(25) 



Clog TO 



TOC7" 



Vn= (j Q{a,a,L) 
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where Q{a,a,L) C V{a,a,L) and An x S„ C [0, oo]^ satisfy the following 
conditions: 



(C2) Q(a,cj,L) = |p e P(a,a,L) : AA(P,a,cj,e) < (^^y - 
(C3) a< 



e(m, cj) 

where Co is the diameter of the support of K. 
Here are some remarks about Vn'- 

1. (C3) implies that e""™ < 2 and hence that {l/2)Dp^a,cr{xi,X2) < 
Da,a{xi, X2) < 2Dp^a,a{xi, X2) with high probability. 

2. (C4) implies that m > (I/ct)"'^^^"'"^) for each a G Hence, from the 
discussion following Theorem 4.1, 

sup P™ sup IIpct -PctIIoo > e("T-,o- 

and thus. Theorem 5.1 and Corollary 5.2 apply. 

3. The constraint in (C2) on M{e) holds whenever P is concentrated on 
or near a set of dimension less than d and a/a'^ is large. The constraint 
does not need to hold for arbitrarily small e. 

4. Some papers on semisupervised learning simply assume that N = 00 
since in practice N is usually very large compared to n. In that case, 
there is no upper bound on a and no lower bound on a. 

The class Vn may seem complicated. This is because showing conditions 
where semisupervised learning provably outperforms supervised learning is 
subtle. Intuitively, the class Vn is simply the set of high concentrated distri- 
butions with a/ a large. 




6.2. Supervised Lower Bound. 
Theorem 6.1 There exists C > such that 

2 



(26) Rs = mf sup Rp{f)> 



n 



18 



AZIZYAN ET AL 



Ai 



Ao-I 

Fig 5. The extended tendrils used in the proof of the lower hound, in the special case where 
d = 2. Each tendril has length 1 — e and joins up with either the top Ai or bottom Aq but 
not both. 

Proof. Let Ai and Aq be the top and bottom of the cube X: 

A\ = {{xi,. . . ,Xd-i,l) ■ < < 1} 

Aq = {{xi, . . . ,Xd-i,0) : < xi, . . . ,Xd_i < 1}. 

Fix e = n ^-i . Let q = (1/e) x n. For any integers s = (si, . . . , Sd-i) G 
N'^^^ with < Sj < 1/e, define the tendril 

{(sie, S2e, • • • , Sd-ie, Xd) : e < Xd < I - e}. 

There are q = (l/e)^~^ ~ n such tendrils. Let us label the tendrils as 
Ti, . . . ,Tq. Note that the tendrils do not quite join up with Aq or Ai. 
Let 

C = AQljA,[j(^[jT)j. 
Define a measure ^ on C as follows: 

where hq is {d — l)-dimensional Lebesgue measure on Aq, fii is {d — 1)- 
dimensional Lebesgue measure on Ai and lyj is one-dimensional Lebesgue 
measure on Tj. Thus, is a probability measure and /i(C) = 1. 

Now we define extended tendrils that are joined to the top or bottom of 
the cube (but not both). See Figure 5. If 

Tj = {{sie,S2e,. . . ,Sd-ie,Xd) : e < < 1 - e}. 
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is a tendril, define its extensions 

Tj,o = {{si€,S2e, . . . ,Sd^if-,Xd) : < < 1 - e} 
Tij = {{si€,S2e,...,Sd-ie,Xd) : e < a;<i < 1}. 

Given a; G O = {0, 1}« let 
and 

P.,x = \no + \ni + ^^^^ E 

where vj^i^. is one-dimensional Lebesgue measure on Tj^i_j.. This Poj^x is a 
probability measure supported on S(jj. 

Notice that consists of two connected components, namely, 

UL'^ = ^1 U ( U T,.A and Ui'^=Ao[j(]J T,,.^ 
Let 

Finally, we define .P^; — Pijj,x ^ P^j y\x where y\x is a point mass at 
UX). Define d\f,g) = f{f{x)-g{x))''dfi{x). 
We complete the proof with a series of claims. 



Claim 1: For each a; G Jl, P^i) € Vn- 

Proof: Since the definition of the {Puj} does not depend on (CI), (C3) 
or (C4), we may simply choose a and a to satisfy these three constraints. 
We must then verify (C2). If x and y are in the same connected component 
then \fuj{x) — fuj{y)\ = 0. Now let x and y be in different components, i.e. 
X G u!^\y G u!^\ Let us choose x and y as close as possible in Euclidean 
distance; hence ||a; — y|| = e. Let 7 be any path connecting x to y. Since 
X and y lie on different components, there exists a subset 70 of 7 of length 
at least e on which P^, puts zero mass. By assumption (C4), a < e/(4Co) 
and hence Px,o- puts zero mass on the portion of 70 that is at least CqO" 
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away from the support of Puj. This has length at least e — 2Coa > e/2. Since 
Px,aix) = on a portion of 70, 

e — y|| 
Dp^aA^^y) ^ 2 " 2 ■ 

Hence, — y|| < 2Dp^a,cT{x,y). Then 

\Ux)-Uy)\ ^ '^\fUx)-Uy)\ 

Dp,a,a{x,y) ~ \\x-y\\ 

and the latter is maximized by finding two points x and y as close together 
with nonzero numerator. In this case, ?/|| = e and \fuiix)— fuiiy)\ = Le/8. 
Hence, \fui{x) — fuiv)] < LDp^a,a{x,y) as required. Now we show that each 
P = P^j satisfies 

J\r{P, a, a, e) < 

i_ 

for all e > n 2+e . Cover the top Ai and bottom ^0 of the cubes with 
Euclidean spheres of radius 6. There are 0{{l/6)'^~^) such spheres. The 
Dp,a,a radius of each sphere is at most Se~°'^^^^^'^'^ . Thus, these form an 
e covering as long as (5e~°^(°)/^'' < e. Thus the covering number of the 
top and bottom is at most 2{l/S)'^-^ < 2(l/(e°^(°)/'^''e))''-^ Now cover 
the tendris with one-dimensional segments of length 6. The Dp^a,a radius 
of each segment is at most 5e~"/'^''. Thus, these form an e covering as long 
as 5e~"^(°)/'^ < e. Thus the covering number of the tendrils is at most 
q/S = n/S < n/(ee"^^°^/'^ ). Thus we can cover the support with 

balls of size e. (C2) then implies that N{e) < (1/e)* for e > n 2+c as 
required. 

Claim 2: For any oj, and any g >0, J g{x)dPi^{x) > | / g{x)dfi{x). 
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Proof: We have 

f f If If X^i It 

/ gdPoj > / gdPuj = - gd/io + - gdfii H —f — 

Js^ Jc 4:Jao ^Jai 2g(l-e) 

If , ^1 f , , T^EjIr^aduj ^ + g(l-2e) 

= T / gdi^o + / gdn\ + — — — — — X — — - — - 



^ 1 A /• , . 1 /• , , Ejlr^adJ^A 1 



2 \ 4 



Claim 3: For any ui,^ G Q, 

d\f.,fu) = 



J A. '''' ^ 4 A, ^^^^ + ^,(132^7 =2 7^^^- 



2,, p{oj,u)L^e\l-2e) 



2g(l - 2e) 

Proof: This fohows from direct calculation. 



Claim 4: If p(a;,z/) = 1 then AP^|| > l/(16e). 

Proof: Suppose that p{l>j,u) = 1. P^^ and Pj/ are the same everywhere 
except Tjfl U Tj^i, where j is the index where uj and differ (assume ojj = 
and Uj = 1). Define A = Tj^ x {0} and B = T,,i x {Le}. Note that AnB = 0. 
So, 

P^Tjfi U T„ i) = P^{A) = p,(r,,o u r,,i) = P^iB) = ^^^^ 

and 

TV(P,,P,) = |P,(A) - P,iA)\ = \P^{B) - P,iB)\ 

1 - e _ 1 _ e'^-i 
~ 2^(1 - e) ~2q~ 

Thus, 

lie A P;|| > ^ (1 - TV(P,, > ^ (l - e'^-V2)'" . 

]_ 

Since e = n ^-i , this implies that 
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for all large n. 



Completion of the proof. Recall that e = n '^-^ . Combining Assouad's 
Lemma with the above claims, we have 

Rs = jnf sup Rp{f)>mf sup Rp{f) > ]-irdmaxE^[(f{f^,f)] 

^ q (L/8)V(l-26) 1 ^^ qe\l-2e) 
- 16 2q{l - 2e) 16e 2q{l - 2e) 

□ 

6.3. Semisupervised Upper Bound. Now we state the upper bound for 
this class. 

Theorem 6.2 Let h = {ne'^^^^~^'^)~^ . Then 



(27) sup Rifh,a,a) < 

Proof. This follows from (C2), (C3) and Corollary 5.2. □ 

6.4. Comparison of Lower and Upper Bound. Combining the last two 
theorems we have: 

Corollary 6.3 Under the conditions of the previous theorem, and assuming 
that d> ^ + 3, 

2(d-3-g) 

, , Rss / 1\ {2+0(d-l) 

(28) ^.(-] ^0 
as n ^ oo. 

This establishes the effectiveness of semi-supervised inference in the min- 
imax sense. 

7. The Reciprocal Distance. In this section we consider a second 
densitive-sensitive metric, called the reciprocal distance. This distance is 
more difficult to implement but it provides a more dramatic distinction 
between supervised and semisupervised methods. 



SEMISUPERVISED INFERENCE 



23 



Define the reciprocal distance 



(29) 



D{xi,X2) = Dp^a{xi,X2) 



inf 

'yeT{xi,X2) 



I 



1 



dt. 







Let J\f{P,a,e) denote the covering number oi Sp under this distance. Let 
T = J^(-P, a, L) denote the set functions / : [0, 1]*^ — >■ M such that, for all 
xi,X2 G X, 



Let V{a, L) denote all joint distributions for (X, Y) such that fp G J'{P, a, L) 
and such that Px is supported on X . The rest of the section shows that 
there is a class Vn where semisupervised inference provably outperforms 
supervised inference under the reciprocal distance. 

7.1. The Class Vn- The condition number t{S) of a set S with boundary 
dS is the largest real number r > such that, if d{x, dS) < r then x has a 
unique projection onto the boundary of S. Here, d{x,dS) = mfzedS ll^^"-^!!- 
When T is large, S cannot be too thin, the boundaries of S cannot be too 
curved and S cannot get too close to being self-intersecting. If S consists 
of more than one connected component, then r large also means that the 
connected components cannot be too close to each other. 

Let = ci(logm)"^/2 and 5m = 2c2\/(i ((log^ m)/m) ^ . Let W(i^, A,r„) 
denote all distributions P such that t(Sp) > Tn, the number of connected 
components of 5p is at most K and 



As before, let N = N{n) and m = m{n) = N — n. Also, let rj > 0. Define 



(30) 



\f{xi) - f{x2)\ < L Dp^a{xi,X2). 



1 < A < inf p{x) < sup p{x) < A < oo. 



(31) 




^71 
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where An C [0, oo] satisfies the following conditions: 

(CI) T„^n-('^-i). 

(C2) e^<min{A/2,A[2i/"-l]}. 




(C5) m ^ . 

(C6) p is Holder(?7) smooth over its support. 

7.2. Supervised Lower Bound. 
Theorem 7.1 Assume d >2. Then, there exists C > such that 

inf sup E„ / (fix) - f{x)fdP{x) > C. 

First we provide some intuition regarding the proof strategy. We construct 
a set of joint distributions over X and Y that depends on n, and apply As- 
souad's Lemma. Intuitively, we need to take advantage of the decreasing 
condition number This is because if r„, were to be kept fixed, as n in- 
creases the semi-supervised assumption would reduce to familiar Euclidean 
smoothness. 

We construct the distributions as follows. We split the unit cube in M'^ 
into two rectangle sets with a small gap in between, and let the marginal 
density p be uniform over these sets. Then we add a series of "bumps" 
between the two rectangles, as shown schematically in Figure 6. Over one of 
the sets we set / = M, and over the other we set f = —M. The number of 
bumps increases with n, implying that the condition number must decrease. 
The sets are designed specifically so that the condition number can be lower 
bounded easily as a function of n. In essence, as n increases these boundaries 
become space-filling, so that there is a region where the regression function 
could be M or — M, and it is not possible to tell which with only labeled 
data. 

Proof. Step 1: Constructing the hypercube. Let I = [con^^^'^~^^ 
with Co > 1 a constant, q = l'^^^, = {0, 1}'' and e = j^. For i G {1, I}, 
let ai = For {1, /j'^-i, let t;^ = (o^^, a^^_^). Define g : R"^-^ 
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Support set of marginal density 




Fig 6. A two-dimensional cross-section of the support of a marginal density p used in the 
proof of Theorem 7.1. 



M as g{x) = 
(32) g{x) = { 



r — \ r 



,2 _ (I 



|x||2)^ for 2 ~ — ll^lls < 5 







o.w. 



for X G M*^ ^, where r G (0,1/4) will be specified later (see Equation 33). 
Let 

B = {{x, Xd) G [-0.5, O.S]'^-! X [0, l\:xd< g{x)}. 
For Tg let 

= {(5?, Xd) G M'^-i X R : ((S - v.)/e, Xd - 1/8) G B] 

and 



B. = {(S, Xd) G M'^-^ X M : ((J - v.)/e, Xd - (1/8 + r)) G 5}. 



Let 



1 



S = {x^W^:3x' = {x',x'a) G [e, 1 - e]'^"^ x [e, - - e] s.t. \\x - x'^ < e} 

8 

and 

5 = {x G M'^ : 3x' = {x,x'd) G [e, l-e]'^"^ x [^+r+e, 1-e] s.t. ||x-x ||2 < e}. 



26 AZIZYAN ET AL 



For any T C {1, l^-^ let = S U yij Bj^j and Sr = 5\ |^ U %j • 

Let r be an arbitrary ordering of {1, Given a; G Jl, let r(a;) = {Ti : 
oji = 1}, and let S"" = 5r(^), ~^ = St(^^), and S"^ = S"^ U S^. 

Let p^{x) = r(.x) = MIs^{x) - Mlg^ix), and let P^,^ be a 

point mass at f^{x). Finally, let denote the measure on W^^^ defined by 
the X marginal p^{x) and the conditional distributionand Py\x- 



Step 2: Lower Bound. Note that Leh{B^) = Leh{B^), and so for any 

uj,uj', Leh{S'^) = Leb(5^') = Leb(5)+Leb(^). Let A = l/(Leb(5)+Leb(5)), 
i.e. A = l/Leb(S'^) for any oo. Let oo,oo' & U such that p{uj,u}') = 1 (where 
p denotes the Hamming distance), and without loss of generality assume 
cji = and uj[ = \. Also denote z = Fj. Then the Li distance between P^ 
and P^' is 



diiP'^,P'^ ) 



\p'^ix)dP^,x-p^ ix)dP^,^\dx 



dx + 



+ 



// 



ar-Y\x ^^Y\x 



dx 



BrTiB^ 



AP^I_^(ix + 



B^\B^ 



= + A Leh{B.\B.) + A Leh{B.\B.) + 2A Leb(S^ n B^) 
= A(Leb(Sy) + Leb(S^)) = 2Ae'^-^ Leb(S) 

where in the first step we have used the fact that x ^ U S^' =^ p^ (x) = 
p'^ [x) = 0, and divided 5'^ U S'^ into four non-intersecting components. 
Then we can bound the affinity of the product measures and for 

p{lo,lo') = 1 as 



l^'n AP„^||>-(l-dl(P^P'^)/2) 



2n 



(1 - Ae*^-^ Leb(B)) 



2n 
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For any uj ^ cj', we have, for arbitrary j € {1, 0"'^^! 



p{oj, oj'){M^ Leh{BjABj) + AM^ Leh{Bj n Bj)) 
2p{Lo, a;')M2(Leb(5j) + Leb(Sj n Bj)) 



= 2p{uj,uj')M^e''-'{Leh{B) +Leh{Br)) 

where the sum is only over indices where lo and co' differ and B^ = {x E B : 
X — (0, ...,0,r) G B}. Then, by Assouad's lemma, 



inf maxE^[d2(/'^, f)] > (Leb(S) + Leb(S^))(l - Ae*^"^ Leb(B))". 



Also we have 
1 
A 



1 1 (r (^) - r'{x)fp'^{x)dx = I (r (x) - r'{x)fdx 
= j{nx)-r'{x)fdx- j {r{x)-r'{x)fdx 



= cP{r, r') - Leh{S'^'\S'^) > d'^if'^, f^') - M^ge^'-i heh{B\Br) 
= d\r, r') - (^^) ' ' (Leb(B) - Leb(B,)). 
Since A > 1, 

inf sup E„ / (fix) - f{x)fdP{x) > inf maxE^^ / {f{x) - {x)fp'^ {x)dx 
f PeVn J '^^^ J 

> ^^^^!^^'^ \ Leh{B) + Leb(5^))(l - Ae'*"^ Leb(S))" - {lef'^ (Leb(5) - Leb(5^)). 

As soon as n > 2'^, Z > 2 and {elf~^ > Clearly Leb(B) < i. Let 

Co > 3. Then e < 1/8 and A < (1 - 2e)-^'^-^\l - 4e - r)"^ < 2'^+\ so 

(l-A6«'-iLeb(B)r> (^1-^) ^ [e-^'/'o-y. 

So if we let cq > (2'^/ log(5/4)) ^(^-1), then e-^'/^d"' > 4/5 and for suffi- 
ciently large n we will have (1 - Ae'^"^ Leb(S))2" > 8/25. Hence, 

(fix) - f{x)fdP{x) > (Leb(B,) - 50Leb(S\5,)) . 
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Since 

Leh{Br) = 7: U - r ^— — — - and Leh{B\Br) < 



2\2 J r(d/2 + i) ^ ^ '■^ - 2<i-ir((d- l)/2 + 1)^ 

then 

Leb(B^) - 50 Leh{B\Br) > - - - r ^ 



2 V2 y r(d/2 + l) 2'>'-ir((d- l)/2 + l) 



TT 



2dT{^) (d + l) 
Now let r be such that 

(33) ,_100(d+l)r^l 

Then 

inf sup E„ / (/(x) - f{x)fdPix) > , -, 

? PeVn J ^ ^ - 50-4'^r(^^) (d + l) 



Step 3: Verifying condition number: For any oj, 

r(5'^) = min|r(5'^),r(5^),- inf inf \\u - v\U 

12 «eS"^es" 

Due to the shape of the function for arbitrary i G {1, we have 

r(S'^) > mui{T{dS),T{dR^\dS)} 

By definition of 5 it is easy to see that T{dS) = e. Also 

r{dB.\dS) = T{{{x,xd) G [-e/2,e/2]''-i x [0, l]:xd = gix/e)]) 

> eT{{{x, Xd) G [-1/2, 1/2]'^-! X [0, 1] : xa = g{x)}) = er. 

Since r < 1, we have t{S^) > re, and similarly t(S^) > re. Now, 

I 1^1 ^^-l II" - ^Il2 ^ ^ , l^J^ 1 \\iu,giu)) - {v,g{v) + r)||2 
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which is smaller than er, so for n sufficiently large, 

which completes the proof. 



Step 4: Each is in "P^. This follows by construction. □ 



7.3. Semisupervised Upper Bound. Define Pm to be the kernel density 
estimator based on the unlabeled data with bandwidth (logm/m)^/(^+'^). 
Let S = {x : Pm{x) > 0}. Recall that = ci(logm)~^/^ and (5^ = 

2c2\/d ((log^ m)/m)^ . Define 

Da{xi,X2)= Jnf / ^^-r-rr^dt 

where 

f{xi,X2) = {7 G r{xi,X2) : yt G [0,L(7)], j{t) G S\1Z9s], 



T^dS = {x : inf ||x — z\\2 < 2(5„ 
We define Da{xi,X2) = 00 if V{x\,X2) = 0. Let 

' EtiQhiDaix,x,)) ■ 

Theorem 7.2 Let h = ^/lJn. Then, 

sup Rp{fh,a) di -■ 
PeVn n 
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Proof. Let S = Sp. Let Sm = { x ^ S : inf ||x — z\\2 > 3Sm } ■ Then, 

z&dS 

ifhA^)-fix)ydPix) = I {f^^^{x)-f{x)fdP{x)+ j {hAx)-f{x)fdP{x) 
Now 

/ {U{x) - f{x)fdP{x) < 2M'^P{S\Sm) < 2AM^ Leh{S\Sm)- 

J S Sm 

Since the radius of curvature of dS is at least r^, and r„ > 35m j we have by 
Proposition 6, 



Leb(5\S„) < Vol(a5)^^^i±%|^<C3 



< C3VH^<3C32'^^<fi 



from (C4) where Vol denotes the d — 1-dimensional volume on dS. 
Arguing as in Theorem 5.1, we have, for each P £ Vn, that 



M 



IE / - f\x)fdP{x) < {R{h/2)f + 



4+iAA(P,a,|) 



4M2 
H 

c ~" ' ' " " ' n m 

where R = sup{Dp^a{xi,X2)/Da{xi,X2) : xi,X2 G Sm} and A/'(P,a,e) 
denotes the covering number of Sm under Da- In Proposition 5 in Section 
1L3 we show that Dp^aixi, X2) < [(A + em) / Daixi, X2) for xi,X2 € Sm- 
By (C2) this implies that R < 2. We also show in Proposition 5 that 
Daixi,X2) < ds^ixi,X2)/{X - em)" for xi,X2 £ Sm- Here, ds^{xi,X2) is 
the length (in Euclidean distance) of the shortest path in Sm connecting xi 
and X2- Thus 

M{P,a,h/2)<J^m f5(A-e„r 



where Mm denotes the covering number under ds^- (C3) implies that h{X — 
Cm)" > T^*-^"^-*. By Proposition 7, each connected component of Sp may be 
covered by one set and hence J\fm (|(A — Cm)") < K. We thus have that 



r2 

' 5*77 

and the result follows since h = n"^/^ and m > n. □ 



SEMISUPERVISED INFERENCE 31 

7.4. Comparison of Lower and Upper Bound. Finally we have: 
Corollary 7.3 Under the conditions of the previous theorem, 

<^^) If ^ (^) - » 

as n oo. 



8. Adaptive Semisupervised Inference. We have established a bound 
on the risk of the density-sensitive semisupervised kernel estimator. The 
bound is achieved by using an estimate Da,a of the density-sensitive dis- 
tance. However, this requires knowing the density-sensitive parameter a, 
along with other parameters. It is critical to choose a (and h) appropriately, 
otherwise we might incur a large error if the semisupervised assumption does 
not hold or holds with a different density sensitivity value a. We consider 
two methods for choosing the parameters. 

The following result shows that we can adapt to the correct degree of 
semisupervisedness if cross-validation is used to select the appropriate a, a 
and h. This implies that the estimator gracefully degrades to a supervised 
learner if the semisupervised assumption (sensitivity of regression function 
to marginal density) does not hold (a = 0). 

For any /, define the risk R{f) = E[(/(X) — y)^] and the excess risk 
£{f) = R{f) - R(f*) = E[(/(X) - f*{X)f] where /* is the true regression 
function. Let Ti he a finite set of bandwidths, let ^ be a finite set of values 
for a and let S be a finite set of values for a. Let 6 = {h,a,a), Q = T-LxAxY] 
and J = |0|. 

Divide the data into training data T and validation data V. For nota- 
tional simplicity, let both sets have size n. Let T = {fj}e&0 denote the 
semisupervised kernel estimators trained on data T using 9 £ Q. For each 
^ E ^ let 

i?^(/T) = -i:(/T(^0-^.)^ 

1=1 

where the sum is over V. Let Yi = f{Xi) + with A/'(0, cj^). Also, we 

assume that |/(a;)|, \ 'fj {x)\ < M, where M > is a constant.^ 



^ Note that the estimator can always be truncated if necessary. 
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Theorem 8.1 Let T = {fJ^eeQ denote the semisupervised kernel estima- 
tors trained on data T using 6 £ Q. Use validation data V to pick 

9 = avgrnin (fl) 
and define the corresponding estimator f = f^. Then, for every < 5 < 1, 

+ A6M'^ 



(35) E[£{fe)] < 

1 — a 



where 0<a<landO<t< 15/(38(M2 + a^)) are constants. E denotes 
expectation over everything that is random. 

Proof. First, we derive a general concentration oi£{f) around <?(/) where 
£{f) = R{f) - RiD = -lT.l=iUu and Ui = -{Y, - f{X,)f + {Y, - 

If the variables Ui satisfy the following moment condition: 
Em-E[U,r]<y^k\r'^-' 

for some r > 0, then the Craig-Bernstein (CB) inequality (Craig 1933) states 
that with probability > 1 — 5, 

n ^-^ nt 2(1 — c 

i=l 

for < tr < c < 1 . The moment conditions are satisfied by bounded random 
variables as well as Gaussian random variables (see e.g. Haupt and Nowak 
(2006)). 

To apply this inequality, we first show that Var(C/j) < 4(M^ + a'^)£{f) 

since Yi = f{Xi) + with ^^■'^ A/'(0, o"^). Also, we assume that |/(x)|, 
|/(x)| < M, where M > is a constant. 

var(c/,) < nuh = n{-{y^-f\x,)f + {y^-nx,)f?] 

= ^{-{r{Xi) + e,-f{Xi)f + {e,ff] 

= E[(-(r (X,) - f{X,)f - 26,(r (X,) - f{X,))f] 

< 4M2f (/) + 4(j2f (/) = 4(m2 + a2)£:(/) 

Therefore using CB inequality we get, with probability > 1 — d, 

nt {1 — c) 
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Now set c = tr = U{M'^ + a'^)/lb and let t < 15/(38(M2 + a"^)). With this 
choice, c < 1 and define 

a= ^- < 1. 

(1-c) 

Then, using a and rearranging terms, with probabihty > 1 — 6, 

log(l/<5) 



(l-a)S{f)-Sif)< 



nt 



where t < 15/(38(M2 + a^)). 

Then, using the previous concentration result, and taking union bound 
over all f E we have with probability > 1 — 5, 



\-a 



log(.//^) 
nt 



Now, 



£{h) = R{h)-R{n 



< 



1 



< 



l-a 
1 

1^ 



logiJ/S) 



R''{f)-R'{n + 



nt 



Taking expectation with respect to validation dataset. 



R{f)-R{r) + 



nt 



Now taking expectation with respect to training dataset, 



^TvlSife)] < Y 



1 



ET[Rif) - Rif)] + 



log(J/<5) 



nt 



Since this holds for all / G J^, we get: 

1 



ETv[£{h)] < 



1 - a 



The result follows. □ 

In practice, both may be taken to be of size for some a > 0. Then we 
can approximate the optimal h, a and a with sufficient accuracy to achieve 
the optimal rate. Setting 5 = l/(4M^n), we then see that the penalty for 
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Fig 7. The swiss roll data set. Point size represents regression function. 

adaptation is ^°^\^/^'^ + 6M = 0{logn/n) and hence introduces only a loga- 
rithmic term. 



Remark: Cross-validation is not the only way to adapt. For example, the 
adaptive method in Kpotufe (2011) can also be used here. 

9. Simulation Results. In this section we describe the results of a se- 
ries of numerical experiments on a simulated data set to demonstrate the 
effect of using the exponential version of the density sensitive metric for 
small, labeled sample sizes. For the marginal distribution of X, we used a 
slightly modified version of the swiss roll distribution used in Gulp (2011b). 
Figure 7 shows a sample from this distribution, where the point size rep- 
resents the response Y. We repeatedly sampled N = 400 points from this 
distribution, and computed the mean squared error of the kernel regression 
estimator using a set of values for a and for labeled sample size ranging from 
n = 5 to n = 320. We used the approxmation (19) with k = 20. 

Figure 8 shows the average results after 300 repetitions of this procedure 
with error bars indicating a 95% confidence interval. As expected, we observe 
that for small labeled sample sizes increasing a can decrease the error. But as 
the labeled sample size increases, using the density sensitive metric becomes 
decreasingly beneficial, and can even hurt. 
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Fig 8. MSE of kernel regression on the swiss roll data set for a range of labeled sample 
sizes using different values of a. 
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10. Discussion. Semisupervised methods are very powerful but, like 
all methods, they only work under certain conditions. We have shown that, 
under certain conditions, semisupervised methods provably outperform su- 
pervised methods. In particular, the advantage of semisupervised methods 
is mainly when the distribution Px of X is concentrated near a low dimen- 
sional set rather than when Px is smooth. 

We introduced a family of estimators indexed by a parameter a. This 
parameter controls the strength of the semi-supervised assumption. The 
behavior of the semi-supervised method depends critically on a. Finally, we 
showed that cross-validation can be used to automatically adapt to a so 
that a does not need to be known. Hence, our method takes advantage of 
the unlabeled data when the semi-supervised assumption holds, but does 
not add extra bias when the assumption fails. Our simulations confirm that 
our proposed estimator adapts well to alpha and has good risk when the 
semi-supervised smoothness holds and when it fails. 

The analysis in this paper can be extended in several ways. First, it is 
possible to use other density sensitive metrics such as the diffusion distance 
(Lee and Wasserman, 2008). Second, we defined a method to estimate the 
density sensitive metric that works under broader conditions than the two 
existing methods due to Sajama and Orlitsky (2005) and Bijral, Ratliff and 
Srebro (2011). We suspect that faster methods can be developed. Finally, 
other estimators besides kernel estimators can be used. We will report on 
these extensions elsewhere. 



11. Additional Proofs. 



11.1. Proof of Theorem 4^.1. To prove Theorem 4.1 We use the approach 
in Yukich (1985). (See also Gine and Guillou (2002) and Prakasa-Rao (1983).) 
If ^ < M, define the bracket [i,u] = {h : i < h < u}. A collection 
{£i,ui), . . . , {£n,un) is a e bracketing of a class of functions T ii T C 
|JjLi[-^i) Uj] and / \uj —ij\^dP < for j = 1, . . . , A^. The bracketing number 
A^[](e, J^, Lp(P)) is the size of the smallest e bracketing. 

Theorem 11.1 Let Xi, ... ,Xn ^ P. Define P{f) = J f{z)dP{z) and Pn{f) 
llTi=if{Xi). LetA = snpff\f\dP and B = sup f \\f\\^. Then 

P^lsnp\Pn{f)-P{f)\>e] < 2iV„(e/8,.F,Li(P))exp^ ^""'^ 



3ne \ 



+27V[](e/8,-F,Li(P))exp 



645 
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Hence, if e < 2A/3, 
(36) 

sup \PM) - P{f)\ >e]< 4iVn(e/8, J-,Li(P)) exp 



Proof. (This proof follows Yukich (1985).) For notational simplicity in the 

proof, let us write, N{e) = A^[](e, J", Li(P)). Define Zn{f) = / f{dPn - dP). 
Let ui], . . . , [^AT, Wat] be a minimal e/8 bracketing. We may assume that 
for each j, \\uj\\ < B and < B. (Otherwise, we simply truncate the 
brackets.) For each j, choose some fj G [^j,Uj\. 

Consider any j and let denote a bracket containing /. Then 

Nn(/)|<Nn(/i)| + kn(/-/i)|. 

Furthermore, 

Wi! - fj)\ = \jU- fjXdPn -dP)\< J\f- fj\ {dPn + dP)< J \Uj - ij\ {dPn + dP) 



J \uj - £j\ {dPn -dP) + 2 J \uj - IjldP 

I 



Hence, 



Thus, 



\Zn{f)\<\Zn{m+[[Zni\Uj-ij\) + l 



P'^(sup \zn{f)\ > e) < P"(max|z„(/,)| > e/2) + P"(max |z„(|n,- - £j\)\ + e/4 > e/2) 



< P"(max|z„(/,)| > e/2) + P"(max |z„(|u,- - £j\)\ > e/4). 

j 3 



Now 



Var(/,)< j f]dP = j \fj\\fj\dP<\\fj\\o,j \fj\dP<AB. 
Hence, by Bernstein's inequality, 

3 ne^ 



P"(max|z,(/,)| > e/2) < 2X;exp (^-l^^^I^^ < 2iV(e/8)exp 



4B6A + e 
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Similarly, 

Vard^x,- - ej\) < J {uj - ijfdP < J \uj - £j\ \uj - £j\dP 

< \\u,-ej\\ooJ\u,-e,\dP<2B^ = ^. 

Also, II Uj — ^jlloo ^ 25. Hence, by Bernstein's inequality, 

P^(max..(|.,-.,|)>e/4) < ^ g exp 

< 2iV(e/8)exp(-||). 

□ 

The following result is from Example 19.7 from van der Vaart (1998). 

Lemma 4 Let T = {/e : Q € 0} where Q is a bounded subset of I 
Suppose there exists a function m such that, for every 6i,92, 

\f0,{x)-f0,ix)\<m{x)\\ei-92\\. 

Then, 



iV[](e,jr,L,(P))< 
Proof. Let 



'AVd diam(G) / \mix)\idP{x) \ 



Ay/d j \m{x)\'idP{x)' 

Wc can cover with (at most) N = (cliam(G)/(5)'^ cubes Ci, . . . , Cn of size 
5. Let ci, . . . , Cat denote the centers of the cubes. Note that Cj C -B(cj, ^fd6) 
where B{x, r) denotes a ball of radius r centered at x. Hence, IJ^- -B(cj, -yfdd) 

covers 0. Let 6j be the projection of Cj onto 0. Then IJ^- B{9j, 25^/d) covers 
0. In summary, for every G there is a G {^i, . . . , ^at} such that 

" ^" - - 2 J\m{x)\idP{x) 

Define ij = fg. —€m{x)/2 J m and Uj = fg. +em{x) /2 J m. Wc claim that 
the brackets [^i, ui], ...,[£ at, utv] cover F. To see this, choose any fg G J^. 
Let 9j be the closest element {^i, . . . , On} to 9. Then 

/,(x) = /»,W + /»(i)-/,j(i)</»,W + l/(,(i)-/ej(i)| 
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By a similar argument, f0{x) > ij{x). Also, J{uj — ij)'^dP < e'^. Finally, 
note that the number of brackets is 

iV = (diam(e)/*)<' = /4yddiann(9)/|».M|WWj ' 

□ 

Now we prove Theorem 4.1. 

Proof. Let 9 = {x,a), O = X x [a, A], fe{u) = a-'^K{\\x - u\\/a) and 
J- = {/g : 6* G e}. We apply Theorem 11.1 with ^ = 1 and 5 = K{Q)/a'^. 
We need to bound Ny^{e, F , Li{P). Let 6 = {x,a'^) and u = {y,T'^). Some 
algebra shows that 

\fe{u)-Uu)\<-^^\\e-v\\. 



Apply Lemma 4 to get 

Ar[](e,^,Li(P))< 
Hence, Theorem 11.1 yields, 
P'^(sup|p„(x)-p,(x)| >e) < 2(^-g-^ 



C 



d+l 



Sne'^a'^ \ ( 3nea'^ \ 

exp I — r + exp 



4i^(0)(6 + e) 7 V 64i^(0) / 

□ 

Note that the proofs of the last two results did not depend on P. Hence, 
the results hold uniformly over P. 



11.2. Propositions for Section 4-3. 
Proposition 1 Given x,y, z ^ R'^, let 71 be the path from x to y such that 

Dp,a,a{x,y)= J exp [-apx,ahi{t))]dt. Then 




\y - Z\ 



1 + exp 



a" 



aK* 



a' 



d+l 



--{L{ii) + \\y-z\ 
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Proof. Let 72 be the straight Une from y to z (i.e. .^(72) = ||y — Then 

Dp,aAy^^)= iiif / exp [-apx,a(7(0)] 
7er(2/,2) J 


< I eyi^[-apx,a{l2{t))\dt < L{'^2) sup eii.^[-apx,a{l2{t))\ 



< H12, 



-^(71) uo 



sup ||7i(*i) - 72(^2) 

tie[0,L(7i)] 
t26[0,L(72)] 



I 



Dp,a,a{x,y) 



+ a sup 

UQ 



exp [-apx,a{'^)] W'^uPxA'^) 



U=Uq 



(L(7i) + L(72)) 



+ a sup 
-^^(71) UO 



U=Uo 



(L(7i) + L(72)) 



^^^^ + ^^max(^(7l) + ^(72)) 



< Dp^aA^^y) 



H12 



1 + exp 



aK„ 



a" 



aK* 



a' 



d+l 



^(L(7i)+L(72)) 



where the second to last inequahty is due to the dominated convergence 
theorem (exchanging differentiation and expectation with respect to P), 
and Jensen's inequahty (the L2 norm is a convex function). □ 

Proposition 2 Given x,y G R'', 



-OipX,a 



x + y 
2 



\\x — y\\ exp 
and 

Dp,a,aix,y) < ||a;-y||exp 



1 — exp 



(7' 



d+l 



-apx,a 



X + y 



1 + exp 



a" 



< Dp^aA^^y) 



Q'-^maxll'^ y\\ 
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Proof. Let 7 be the straight line from x to y. 

Dp,a,a{x,y) < / exp[-apx,a {'y{t))]dt 



< L{-f) sup exp[-apx,<7 (lit))] 

te[0,L(7)] 



< ||x — y\\ yexp 

< \\x — y\\ ^exp 

< ||x — y\\ exp 



-apx,a 
-apx,a 
apx,, 



X + y 
2 

X + y 
2 

X + y 



X — y 
H 5 sup 

'^-^maxll'^ y 



V„ exp [-apx,a{u] 



+ 



1 + exp 



max 
d 



Q^-^maxll''" ^1 



Now the ball B {^^^, \\x — y\\) contains the balls Bi = B (x, ||x — y||/2) and 
B2 = -B (y, ||x — y||/2). The integral over any path 7 connecting x and y is 
at least as large as the integral over ^ r\B (^^^, ||x — y||). Hence, 



Dp,a,a{x, y) > \\x - y\\ inf exp [-apx,a [u)] 

\u(^B{^,\\x-y\\) 



> ||x — y\\ exp 



-apx,, 



-apx,a 



x + y 
2 

x + y 



Q^-^maxll-^ y\ 



a' 



1 — exp 



d+l 

a' 



max 
d 



'-*^-^maxll'^ y\ 



a' 



d+l 



□ 



11.3. Proofs For Section 7.3. 

Proposition 3 // m > mo, where mo = mo(A,A) is a constant, then for 
all marginal densities p of distributions in Vn, we have with probability > 
1 — 1/m, 

sup \p{x) -pm{x)\ < em and dS C TZqs 

x£S\nss 

where = ci(logm)^"'^/^ for constant ci = ci{K,C2,d,r],A), S = {x : 

P.,.,.) > 0). an^nss = |. : mt ||. - .||, < ..„} K = 2„_V3 

[ zeds J V 

for some constant C2 > 0. 
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Proof. Follows from Theorem 1 in Singh, Nowak and Zhu (2008b) by 
noting that since the density estimate will be a.s. outside the boundary 
region, and we have p > A on S", for sufficiently large m (i.e. small Cm), we 
must have S\JZqs C S" C S" U TZdS- ^ 

Proposition 4 Assume sup \pmix) — p{x)\ < Cm and dS C TZqs- Let 

xeS\Rgs 



Da{xi,X2) = Jnf 

7er(a;i,a;2) 



1 



dt 



and ^ = {(xi, X2) : xi, X2 G S\1Zqs-, r(xi, X2) 7^ 0}. Then for any (xi, X2) G 



A 



A + e, 



Da{xi,X2) < Da{xi,X2) < 



A 



(A — em)+ 



Da{xi,X2) 



Proof. Note that by the triangle inequality, TZqs ^ 'TZas-, so S\R.qs ^ 
S\R.Qs since r„ > 25m for m large enough. We see that if (xi,X2) G ^, 
then X and y must be in the same connected component of S\TIqs, and, 
furthermore, all points along any path in T{xi,X2) must also be in the same 
connected component. For {xi,X2) £ ^, 



Da{xi,X2) = Jnf 



< inf 

7er(xi,x2) 



dt 



-dt 



< sup 



p(7(i))« 

Da{xi,X2) 



sup , ^ / ^^ I 
fe[0,L(7)] VPm(7(i))y 



and 



, ^(2) 

sup 



< su ( P(^) 



< 



A 

(A — em)+ 



So 
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Similarly, 

Da{xi,X2)> inf , , Da{xi,X2)>{— Da{xi,X2). 

zes\nas \P[z) + 

□ 

Proposition 5 With the notation of Proposition 4, for all xi,X2, 

Da{xi,X2) < Daixi,X2). 

Assume sup \Pm{x)—p{x)\ < em anddS C TZos- Then for any {xi,X2) G 

~ , , (^QN-T? (X1,X2) 

^, D^{xi,X2) < -^^^ and 



A" 

A \" 



A + 



Da{xi,X2) < Da{xi,X2) < 



dsmixi,X2) 
(A — em)+ 



where we recall that S„ 



X G 5 : inf \\x — zlU > 35^ ^. T/ius 



Daixi,X2) < 



A 



-Da(xi,X2). 



Proof. Since for any xi and X2, r(a;i, 2:2) C r(3;i, 3:2), clearly I? q,(xi, X2) < 
L>a(xi,X2). If OS" C Tigs, 



Da{xi,X2) 



Jnf 



1 







p{i{t)y 



-dt < 



1 



sup 



p{zY 



i(7) 

inf / dt 

7er(a;i,X2) J 



since, by the triangle inequality, 5m C S\R.qs- Applying Proposition 4, the 
result follows. □ 

Proposition 6 Let X he a compact subset ofW^, and T > 0. Then for any 
T S (0,T), for all sets S X with condition number at least r, \o\{dS) < 
cs/t for some C3 independent ofr, where Vol is the d—1- dimensional volume. 

Proof. Let {zi}^^ be a minimal Euclidean r/2-covering of dS, and Bi = 
{x : \\x — Zi\\2 < t/2}. Let Ti be the tangent plane to dS at Zi. Then using 
the argument made in the proof of Lemma 4 in Genovese et al. (2010), 



Vol(B,; n dS) < Ci \ol{Bi n T,; 



< C2T 



d-1 
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for some constants Ci and C2 independent of r. Since X is compact, 

M{dS,\\ ■ ||2,r/2) < C 

for some constant C depending only on X and T, where M denotes the 
covering number (note that even though dS is a d — 1 dimensional set, we 
can't claim J\f{dS, \\ ■ \\2,t) = 0{t~^^~^^), since dS can become space-filling 
as r — )• 0). So 

N 

Yo\{dS) < Y^Vol{Bi n dS) < C2t'^"W(55, || • ||2,t/2) < C2Ct~^ 
and the result follows with C3 = C2C. □ 

Proposition 7 Let X be a compact subset o/M'^, and T > 0. Then for any 
T G (0,r), for all compact, connected sets S C X with condition number at 
least T, sup ds{u,v) < CiT^~'^ for some C4 independent of t. 

u,v£S 

Proof. First consider the quantity sup ds{u,v). Since dS C S, clearly 

u,v&dS 

sup ds{u,v) < sup dQs{u,v). 

u,v(idS u,v(idS 

Since dS is closed, there must exist u*,v* G dS such that 

sup dQs{u,v) = das{u*,v*). 

Let {zi}^^ be a minimal r-covering of dS in the dgs metric. Let {ij}^]^ C 
{^i}iLi such that dgs{u*,^i) < ''"j dds{v* ,Zj^) < r, and for any 1 < i < iV— 1, 
das(Si,Zi+i) < 2t. Then 

N-l 

das{u*,v*) < dds{u*,zi) + dds{v*,Zj^) + ^ dasC^i, < 2tN. 

1=1 

So, dgs{u*,v*) < 2Tj\f{dS, dQs,T). By Proposition 6.3 in Niyogi, Smale and 
Weinberger (2008) (or see Lemma 3 in Genovese et al. (2010)), if x,y € dS 
such that \\x — y\\2 = a < r/2, then dgs{x,y) < t — T\J\ — {2a) /t. In 
particular, if ||x — y\\2 < t/2, then dQs{x^y) < r. So any Euclidean r/2- 
covering of dS is also a r-covering in the dgs metric. Then we have 

snv ds{u,v) < dasiu*,v*) <2TMidS,d3s,T) <2TM{dS,\\-\\2,T/2) 

u,vedS 

< Ct(-Y = Ct'-'' 
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for some constant C depending only on X and T (note that, as in the proof of 
6, even though dS is a d— 1 dimensional set, we can't claim J\f{dS, || • II2, t) = 
(9(t--(<^-i))j since dS can become space-filling as r — )• 0). 

Now let u\v^ S S such that sup ds{u,v) = ds{u'^,v'^) which must exist 

since S is compact. Let u^,v^ E dS be the (not necessarily unique) projec- 
tions of and onto dS. Clearly the line segment connecting and 
is fully contained in S, and the same applies to and v^. So, 

dsiu"^ jv"^) < ds{u^ ,u^) + ds{u^,v^) + ds{v^,v^) 

< \\u^ — u^\\2 + Ib^ — v^\\2 + ds{u* ,v*) 

< 2 diam(A') + Ct^''^ 

and setting C4 = 2T'^^^ diam(A') + C, the result follows. □ 
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